Acta Psychiatrica Scandinavica — Latest Matching Preprints

1

Artificial intelligence for detecting bipolar disorder in electronic health records of patients with affective diagnoses: a diagnostic accuracy study

Ferro, E.; Gomez-Puentes, A. M.; Castano-Villegas, N.; Monsalve Barrientos, K.; Torres-Delgado, C.; Ortiz, L.; Esteban Cardenas, M. F.; Zea, J.

2026-05-10 psychiatry and clinical psychology 10.64898/2026.05.07.26352679 medRxiv

Top 0.1%

12.2%

Show abstract

BackgroundBipolar disorder (BD) is frequently underdiagnosed, particularly in patients presenting with depressive disorders, leading to delays in appropriate treatment. Artificial intelligence (AI) applied to electronic health records (EHRs) may improve early detection by identifying clinically relevant symptom patterns. ObjectiveTo evaluate the diagnostic performance of a natural language processing (NLP)-based AI model for detecting BD-related features in EHRs of patients with affective diagnoses. MethodsA retrospective diagnostic accuracy study was conducted using 500 EHRs from a psychiatric referral hospital in Bogota, Colombia (2020-2024). The model extracted 18 predefined clinical domains from unstructured text and classified patients into four risk categories. Diagnostic performance was assessed in a validation subset of 100 records using independent psychiatric evaluation as the reference standard. Sensitivity, specificity, positive and negative predictive values, F1-score, and area under the receiver operating characteristic curve (AUC-ROC) were calculated. ResultsThe model achieved high agreement in symptom extraction (mean 91.1%). Sensitivity was 96.4% (95% CI: 87.7%-99.0%) and specificity was 84.4% (95% CI: 71.2%-92.3%), with an F1-score of 0.92 and an AUC-ROC of 0.932 (95% CI: 0.881-0.975). A substantial proportion of patients with depressive diagnoses were identified as having confirmed BD or clinically relevant risk. The model analyzed complete EHRs 120 times faster than human reviewers. ConclusionsNLP-based analysis of EHRs can achieve clinically meaningful performance in identifying BD-related patterns while substantially reducing review time. The model may be useful as a clinical decision support tool for earlier identification of bipolar disorder.

2

An independent supervisory safety agent improves reaction of large language models to suicidal ideation

Trivedi, S.; Simons, N. W.; Tyagi, A.; Ramaswamy, A.; Nadkarni, G. N.; Charney, A. W.

2026-04-15 psychiatry and clinical psychology 10.64898/2026.04.13.26350757 medRxiv

Top 0.1%

8.2%

Show abstract

BackgroundLarge language models (LLMs) are increasingly used in mental health contexts, yet their detection of suicidal ideation is inconsistent, raising patient safety concerns. MethodsWe conducted a cross-sectional evaluation using 224 paired suicide-related clinical vignettes presented in a single-turn format under two conditions (with and without structured clinical information). Native LLM safeguard responses were compared with an independent supervisory safety architecture with asynchronous monitoring. The primary outcome was detection of suicide risk requiring intervention. ResultsThe supervisory system detected suicide risk in 205 of 224 evaluations (91.5%) versus 41 of 224 (18.3%) for native LLM safeguards. Among 168 discordant evaluations, 166 favored the supervisory system and 2 favored the LLM (matched odds ratio {approx}83.0). Both systems detected risk in 39 evaluations, and neither in 17. Detection was highest in scenarios with explicit suicidal ideation and lower in more ambiguous presentations. ConclusionsNative LLM safeguards frequently failed to detect suicide risk in this structured evaluation. An independent monitoring approach substantially improved detection, supporting the role of external safety systems in high-risk mental health applications of LLMs.

3

Structured large language model extraction of clinical factors from electronic health record text supports scalable psychiatric severity prediction

Stephenson, C.; Camassa, A.; Wagner, M.; Shirazi, A. H.; Alavi, N.; Omrani, M.

2026-05-13 psychiatry and clinical psychology 10.64898/2026.05.11.26352839 medRxiv

Top 0.1%

7.2%

Show abstract

BackgroundMental health systems face escalating demand that exceeds clinician capacity, making accurate severity-based triage a critical bottleneck. Severity assessment guides treatment intensity, resource allocation, and risk management, yet most clinically relevant information remains embedded in unstructured electronic health record (EHR) narratives, limiting its utility for scalable decision support. ObjectivesThis study evaluates whether a single large language model (LLM) can autonomously extract clinical factors from psychiatric EHR narratives, derive predictive weights from those factors, and use the resulting structured representation to predict clinician-implied severity at scale. MethodsFrom a Mayo Clinic repository of more than 2.7 million encounters, 15,000 de-identified psychiatric notes were sampled into a 5,000-patient discovery cohort and a 10,000-patient replication cohort. The same LLM (Llama 3 8B Instruct) extracted 17 background clinical factors and 3 treatment-action factors from each note. Severity reference labels were derived from the treatment-action factors using pre-specified clinical criteria. The LLM independently derived two factor-weight dictionaries from the discovery cohort: one capturing risk-oriented predictors of severe presentations and one capturing protective predictors. Five weighting conditions were then evaluated against the severity labels: the two LLM-derived dictionaries, two controls (LLM-derived variables with randomized weights; clinically irrelevant variables with arbitrary weights), and an unweighted zero-shot baseline. Performance was assessed across 928 valid iterations in the replication cohort. ResultsLLM-derived structured conditions significantly outperformed all controls and the baseline, with statistically equivalent performance between the two structured conditions. Improvements in precision and recall were balanced, indicating gains in discriminative capacity rather than threshold shifts. The variables and weights the LLM derived as predictors of severe presentations aligned closely with established clinical determinants of psychiatric severity. ConclusionA single LLM can derive clinically meaningful factor weights from unstructured EHR narratives and use them to predict psychiatric severity at scale, supporting a viable path toward interpretable, scalable triage in resource-constrained mental health systems.

4

Predicting clozapine initiation among patients with schizophrenia via machine learning trained on electronic health record data

Perfalk, E.; Damgaard, J. G.; Danielsen, A. A.; Ostergaard, S. D.

2026-04-20 psychiatry and clinical psychology 10.64898/2026.04.17.26351083 medRxiv

Top 0.1%

6.4%

Show abstract

Background and HypothesisClozapine is the only medication with proven efficacy for treatment-resistant schizophrenia, yet many patients experience delays of several years before initiation. Our aim was to develop and validate a dynamic prediction model for clozapine initiation among patients with schizophrenia trained solely on electronic health record (EHR) data from routine clinical practice. Study DesignEHR data from all adults ([≥] 18 years) with a schizophrenia (ICD10: F20) or schizoaffective disorder (ICD10: F25) diagnosis who had been in contact with the Psychiatric Services of the Central Denmark Region between 1 January 2013 and 1 June 2024 were retrieved. 179 structured predictors were engineered (covering, e.g.,diagnoses, medications, coercive measures) and 750 predictors derived from clinical notes. At every psychiatric hospital visit, we predicted if an incident clozapine prescription occured within the next 365 days. XGBoost and logistic regression models were trained on 85% of the data with 5-fold stratified cross-validation. Performance was evaluated on the remaining 15% of the data (held out) using the area under the receiver operating characteristic curve (AUROC). Study ResultsThe training/test set comprised of 194,234/35,527 hospital visits, distributed on 4928/878 unique patients. In the test set, the best XGBoost model achieved an AUROC of 0.81, sensitivity of 32%, positive predictive value of 23% at a 7.5% predicted positive rate. ConclusionsA dynamic prediction model based solely on EHR data predicts clozapine initiation with high discrimination. If implemented as a clinical decision support tool, this model may guide clinicians towards more timely initiation of clozapine treatment.

5

A twin-aware multimodal deep learning framework with optimized late fusion for early prediction of adolescent anxiety disorder

Taosif, M.; Chaman, U. M.; Prova, N. A.; Taher, S. M.; Alam, M. G. R.; Rahman, R.

2026-03-16 psychiatry and clinical psychology 10.64898/2026.03.13.26348360 medRxiv

Top 0.1%

6.3%

Show abstract

Mental health related problems in adolescents are not always properly evaluated because of incomplete evaluation methods that do not combine biological, behavioral, and demographic details. Therefore, our study proposes a twin-aware multimodal deep learning framework applied to the QTAB dataset for early prediction of adolescent anxiety disorders. We employ a 3D convolutional neural network for neuroimaging data and prototype-based learning modules with residual encoders for behavioral and phenotypic data. Each modality-specific encoder learns compact representations optimized for class-imbalanced prediction through multi-loss objective functions. Calibrated probability outputs from the three modules are combined via optimized weighted late fusion. The framework achieves an AUC of 0.8935 (95% CI: 0.792-0.969), representing an absolute gain of 11 percentage points over the best unimodal baseline (questionnaire: AUC = 0.7766), with a sensitivity of 85.7% and a specificity of 87.3%. Pairwise statistical testing indicated that the classification patterns of the fusion model differ significantly from the questionnaire-only baseline (McNemar p = 0.0008), though AUC differences did not reach statistical significance at this sample size (DeLong p > 0.05). The best fusion weights were 23% MRI, 63% questionnaire, and 14% phenotypic, highlighting the dominant role of behavioral data. These results demonstrate that calibrated late fusion of multimodal predictions provides robust performance for early adolescent anxiety screening in twin cohorts with family-aware evaluation protocols.

6

Benchmarking Language Models for Clinical Safety: A Primer for Mental Health Professionals

Flathers, M.; Nguyen, P. A. H.; Herpertz, J.; Granof, M.; Ryan, S. J.; Wentworth, L.; Moutier, C. Y.; Torous, J.

2026-03-23 psychiatry and clinical psychology 10.64898/2026.03.20.26348900 medRxiv

Top 0.1%

4.9%

Show abstract

BackgroundMillions of people use language models to discuss mental health concerns, including suicidal ideation, but limited frameworks exist for evaluating whether these systems respond safely. Benchmarking, the practice of administering standardized assessments to language models, offers direct parallels to clinical competency evaluation, yet few clinicians are involved in designing, validating, or interpreting these assessments. AimsTo introduce mental health professionals to benchmarking language models by administering a validated clinical instrument and demonstrating how configuration decisions, measurement limitations, and scoring context affect result interpretation. MethodWe administered the Suicide Intervention Response Inventory (SIRI-2) programmatically to nine commercially available language models from three providers. Each item was presented 60 times per model (three prompt variants x two temperature settings x 10 repetitions), yielding 27,000 model responses compared against point-in-time expert consensus. ResultsTotal scores ranged from 19.5 to 84.0 (expert panel baseline: 32.5). Prompt design alone shifted individual model scores by as much as the difference between trained and untrained human groups. The best performing model approached the instruments measurement floor. All nine models consistently overrated clinically inappropriate responses that sounded supportive. ConclusionsA single benchmark score can support markedly different claims depending on the assumed standard of clinical behavior, the instruments remaining measurement range, and the configuration that produced the result. The skills required to make these distinctions must become core competencies. Benchmark results are increasingly utilized to support claims about mental health safety that may not be accurate, making it necessary to close the gap between clinical measurement and AI. Plain Language SummaryAI chatbots like ChatGPT, Claude, and Gemini are increasingly used by millions of people to discuss mental health problems, including thoughts of suicide. To assess whether these systems handle such conversations safely, researchers give them standardized tests called benchmarks and compare their answers to those of human experts. These scores are already used to argue AI systems are ready for clinical use. This study gave a well-established test of suicide response skills to nine AI models from three major companies under varying conditions. We changed how much instruction the AI received and how much randomness was built into its responses, then measured whether the scores changed. The same AI model could score like a trained crisis counselor under one set of conditions and like an untrained undergraduate under another, depending on choices the person running the test made. Every model also made the same kind of mistake: responses that sounded warm and caring were rated as appropriate, even when experts had judged them to be clinically problematic. The highest-scoring model performed so well that the test could no longer measure whether it was truly skilled or had simply exceeded the tests range. These findings show that a single score can be misleading without knowing how the test was run, whether it can still distinguish strong from weak performance, and whether it matches what the AI is used for. Mental health professionals routinely make these judgments about clinical assessments and are well positioned to bring that expertise to AI evaluation.

7

Data Diversity vs. Model Complexity in the Prediction of Pediatric Bipolar Disorder: Evidence from Academic and Community Clinical Samples

Shi, Z.; Youngstrom, E. A.; Liu, Y.; Youngstrom, J. K.; Findling, R. L.

2026-03-27 psychiatry and clinical psychology 10.64898/2026.03.26.26349447 medRxiv

Top 0.1%

4.8%

Show abstract

Pediatric bipolar disorder is challenging to diagnose accurately due to symptom heterogeneity. More standardized and data-driven approaches are needed to enhance diagnostic reliability. We evaluated a clinical decision tool (nomogram), statistical methods (logistic regression, LASSO), machine learning (support vector machine, random forest, k-nearest neighbors, extreme gradient boosting), and deep learning model (multilayer perceptron) for pediatric bipolar disorder prediction across two datasets collected in academic (N=550) and community (N=511) clinical settings. We compared three modeling strategies: cross-dataset validation, cross-dataset with interaction terms, and mixed-dataset. We assessed model performance using discrimination ability, calibration, and predictor importance ranking. In the baseline cross-dataset approach, all models showed good internal discrimination in the academic dataset; but external discrimination in the community dataset substantially declined. Interaction-enhanced models slightly improved internal discrimination but not external performance or calibration. Recalibration prominently improved cross-dataset calibration without compromising discrimination, indicating that transportability problems were largely driven by probability scaling. Models trained on mixed datasets exhibited much stronger external discrimination and calibration. Across models and training strategies, family risk and PGBI-10M were consistently ranked as the most important predictors. Predictive models for pediatric bipolar disorder showed strong internal performance but limited cross-setting generalizability due to dataset shift and miscalibration. Increasing model complexity did not improve external performance, whereas training on pooled data substantially improved both discrimination and calibration. Findings suggest that sampling diversity, rather than model complexity, is more valuable for developing clinically useful and generalizable psychiatric prediction models, underscoring the importance of open and collaborative datasets.

8

Artificial Intelligence Agents in Mental Health: A Systematic Review and Meta Analysis

Zhu, L.; Wang, W.; Liang, Z.; Tan, W.; Chen, B.; Lin, X.; Wu, Z.; Yu, H.; Li, X.; Jiao, J.; He, S.; Dai, G.; Niu, J.; Zhong, Y.; Hua, W.; Chan, N. Y.; Lu, L.; Wing, Y. K.; Ma, X.; Fan, L.

2026-04-22 psychiatry and clinical psychology 10.64898/2026.04.21.26351365 medRxiv

Top 0.1%

4.6%

Show abstract

The rapid rise of large language models (LLMs) and foundation models has accelerated efforts to build artificial intelligence (AI) agents for mental health assessment, triage, psychotherapy support and clinical decision assistance. Yet a gap persists between healthcare and AI-focused work: while both communities use the language of "agents," clinical research largely describes monolithic chatbots, whereas AI studies emphasize agentic properties such as autonomous planning, multiagent coordination, tool and database use and integration with multimodal mental health data streams. In this Review, we conduct a systematic analysis of mental health AI agent systems from 2023 to 2025 using a six-dimensional audit framework: (i) system type (base model lineage, interface modality and workflow composition, from rule-based tools to role-aware multi-agent foundation-model systems), (ii) data scope (modalities and provenance, from elicited self-report and chatbot dialogues to electronic health records, biosensing and synthetic corpora), (iii) mental health focus (mapped to ICD-11 diagnostic groupings), (iv) demographics (age strata, geography and sex representation), (v) downstream tasks (screening/triage, clinical decision support, therapeutic interventions, documentation, ethical-legal support and education/simulation) and (vi) evaluation types (automated metrics, language quality benchmarks, safety stress tests, expert review and clinician or patient involvement). Across this corpus, we find that most systems (1) concentrate on depression, anxiety and suicidality, with sparse coverage of severe mental illness, neurocognitive disorders, substance use and complex comorbidity; (2) rely heavily on text-based self-report rather than clinically verified longitudinal data or genuinely multimodal inputs; (3) are implemented as single-agent chatbots powered by general-purpose LLMs rather than role-structured, workflow-integrated pipelines; and (4) are evaluated primarily via offline metrics or vignette-based scenarios, with few prospective, clinician- or patient-in-the-loop studies. At the same time, an emerging class of agentic systems assigns foundation models explicit roles as planners, retrieval agents, safety auditors or supervisors coordinating other models and tools. These multiagent, tool-augmented workflows promise personalization, safety monitoring and greater transparency, but they also introduce new risks around reliability, bias amplification, privacy, regulatory accountability and the blurring of clinical versus non-clinical roles. We conclude by outlining priorities for the next generation of mental health AI agents: clinically grounded, role-aware multi-agent architectures; transparent and privacy-preserving use of clinical and elicited data; demographic and cultural broadening beyond predominantly Western adult samples; and evaluation pipelines that progress from offline benchmarks to longitudinal, real-world studies with routine safety auditing and clear governance of responsibilities between agents and human clinicians.

9

Predictive value of EEG/ECG Biomarkers for Treatment Response in Depression

Provaznikova, B.; de Bardeci, M.; Altamiranda, E.; Ip, C.-T.; Monn, A.; Weber, S.; Jungwirth, J.; Rohde, J.; Prinz, S.; Kronenberg, G.; Bruehl, A.; Bracht, T.; Olbrich, S.

2026-03-27 psychiatry and clinical psychology 10.64898/2026.03.25.26349315 medRxiv

Top 0.1%

4.3%

Show abstract

Objective: Major depressive episodes frequently show limited response to first-line treatments, motivating the search for objective biomarkers. EEG/ECG-based support tools aggregating electrophysiological predictors may guide treatment selection. We examined whether antidepressant treatments concordant with an EEG/ECG-biomarker report were associated with higher response rates. Methods: We retrospectively analyzed adults with ICD-10 depressive disorder or bipolar depression treated with electroconvulsive therapy (ECT), repetitive transcranial magnetic stimulation (rTMS), (es)ketamine, or selective serotonin reuptake inhibitors (SSRIs) between 2022 and 2024. Resting-state EEG with simultaneous ECG generated individualized biomarker reports with modality-specific response likelihoods. Treatment chosen by clinical teams was classified as concordant or non-concordant; response was derived from routinely collected clinical scales. Results: Among 153 patients (ECT n=53, rTMS n=48, (es)ketamine n=36, SSRIs n=16), response rates were higher for concordant vs non-concordant treatments: ECT 70% vs 50%, rTMS 30% vs 13%, (es)ketamine 31% vs 10%, and SSRIs 100% vs 11%. Overall, 46% (42/92) of concordant vs. 26% (14/54) of non-concordant patients responded (absolute difference +20 percentage points; relative increase {approx}77%; number needed to treat {approx}5). Conclusion: Concordance with EEG/ECG biomarkers correlated with higher treatment response, warranting confirmation in prospective trials. Significance: EEG/ECG-based decision support may enhance antidepressant treatment response in everyday clinical practice.

10

Developing a Tiered Machine Learning Alert System for Real-Time Suicide Risk Detection in a Digital Mental Health Setting

Donegan, M. L.; Srivastava, A.; Peake, E.; Swirbul, M.; Ungashe, A.; Rodio, M. J.; Tal, N.; Margolin, G.; Benders-Hadi, N.; Padmanabhan, A.

2026-03-30 psychiatry and clinical psychology 10.64898/2026.03.26.26349452 medRxiv

Top 0.1%

4.3%

Show abstract

The goal of this work was to leverage a large corpus of text based psychotherapy data to create novel machine learning algorithms that can identify suicide risk in asynchronous text therapy. Advances in the field of natural language processing and machine learning have allowed us to include novel data sources as well as use encoding models that can represent context. Our models utilize advanced natural language processing techniques, including fine-tuned transformer models like RoBERTa, to classify risk. Subsequent model versions incorporated non-text data such as demographic features and census-derived social determinants of health to improve equitable and culturally responsive risk assessment, as well as multiclass models that can identify tiered levels of risk. All new models demonstrated significant improvements over our previous model. Our final version, a multiclass model, provides a tiered system that classifies risk as "no risk," "moderate," or "severe" (weighted F1 of 0.85). This tiered approach enhances clinical utility by allowing providers to quickly prioritize the most urgent cases, ensuring a more accurate and timely intervention for clients in need.

11

Identification of Suicide-Related Subgroups Using Latent Class Analysis: Complementary Insights to Explainable AI-Based Classification

Kizilaslan, B.; Mehlum, L.

2026-03-27 psychiatry and clinical psychology 10.64898/2026.03.25.26349264 medRxiv

Top 0.1%

4.0%

Show abstract

Purpose: Suicide and self-harm are major public health concerns characterized by substantial clinical and psychosocial heterogeneity. While latent class analysis has been used to identify subgroups of people with suicidal behavior, the extent to which such population-level phenotyping complements explainable artificial intelligence-based classification models remain unclear. Methods: We applied latent class analysis to a cross-sectional, publicly available dataset of 1000 individuals presenting with self-harm and suicide-related behaviors at Colombo South Teaching Hospital, Kalubowila, Sri Lanka. Sociodemographic, psychosocial, and clinical variables were used to identify latent subgroups. Class characteristics and suicide prevalence were examined and compared with variable importance patterns reported in a previously published explainable artificial intelligence (XAI)-based suicide classification study using the same dataset. Results: Four latent classes were identified. Two classes exhibited very high suicide prevalence (91.2% [95% CI: 87.7-93.8] and 99.0% [95% CI: 96.4-99.7]), whereas two classes showed low prevalence (<1%). The two high-prevalence classes differed markedly in lifetime psychiatric hospitalization history, with one class showing a 100% prevalence of prior hospitalization and the other substantially lower hospitalization rates. These patterns partially aligned with, and extended beyond, variable importance findings from the XAI-based model. Conclusion: Latent class analysis identified distinct subgroups with substantially different suicide prevalence and clinical profiles, underscoring the heterogeneity of individuals presenting with self-harm. Comparison with XAI-based suicide classification model findings suggest that unsupervised phenotyping and supervised classification provide complementary perspectives, offering population-level context that may enhance the interpretability of suicide assessment frameworks. Keywords: suicide; self-harm; latent class analysis; explainable artificial intelligence; machine learning

12

Multi-Criteria Validation of LLM-Inferred Depression Severity from Outpatient Psychiatry Notes

Cudic, M.; Meyerson, W. U.; Wang, B.; Yin, Q.; Khadse, P. N.; Burke, T.; Kennedy, C. J.; Smoller, J. W.

2026-03-12 psychiatry and clinical psychology 10.64898/2026.03.11.26348066 medRxiv

Top 0.1%

3.8%

Show abstract

BackgroundLongitudinal measurement of depression severity in outpatient psychiatric care is limited by infrequent standardized assessments. Although psychiatric clinical notes capture illness burden and functional impairment, this information is rarely quantified for analysis. ObjectiveTo evaluate whether large language models (LLMs) can infer clinically meaningful measures of depression severity from outpatient psychiatry notes. MethodsWe sampled 91,651 outpatient psychiatry notes from 8,287 adult patients across 58 clinics within a large academic medical center between 2015 and 2021. A HIPAA-compliant LLM (OpenAI GPT-5.2) was prompted to independently estimate three depression severity scores (Patient Health Questionnaire-9 [PHQ-9], Hamilton Depression Rating Scale [HAM-D], and depression-specific Clinical Global Impression-Severity [CGI-S]) from notes, with patient-reported PHQ-9 content within notes redacted to prevent biasing. Convergent validity was assessed against patient-reported PHQ-9 (n=3,757), study-clinician chart review (n=125), and treating-clinician suicide risk assessments (SRA; n=2,985). Predictive validity was evaluated using survival models of antidepressant switching and psychiatric emergency visits. Discriminant validity across diagnoses and consistency across demographic groups and clinics were also evaluated. Results10.8% of eligible visits had a PHQ-9 recorded within 7 days before the encounter. LLM-inferred PHQ-9 scores showed moderate agreement with patient-reported PHQ-9 (Cohens {kappa}=0.64, 95%CI:0.62-0.66; Pearson r=0.67, 95%CI: 0.65-0.68). Stronger agreement was found between LLM CGI-S and study-clinician chart review ({kappa}rater1=0.79, 95%CI: 0.70-0.85; {kappa}rater2=0.67, 95%CI: 0.58-0.77; r=0.86 with mean rating, 95%CI: 0.80-0.90). In prospective analyses, LLM CGI-S predicted antidepressant switching (C-index=0.60; CI95%: 0.58-0.62) and psychiatric emergency visits (C-index=0.63; 95%CI: 0.57-0.68), which was comparable to the predictive performance of patient-reported PHQ-9 and treating-clinician SRA. Correlations between LLM CGI-S and patient-reported PHQ-9 were consistent across clinics (I2<0.1) but significantly lower among Black (r=0.48, 95%CI: 0.38-0.57) and Hispanic (r=0.43, 95%CI: 0.27-0.56) patients. ConclusionsLLM-inferred depression severity scores from psychiatric outpatient notes support longitudinal, standardized phenotyping of depression severity, such as for routine outcome monitoring. These results have implications for facilitating genetic, pharmacoepidemiologic, and antidepressant treatment effectiveness studies using real-world evidence.

13

Evaluating Large Language Models for Assessment of Psychosis Risk

Zhu, T.; Tashevski, A.; Taquet, M.; Azis, M.; Jani, T.; Broome, M. R.; Kabir, T.; Minichino, A.; Murray, G. K.; Nour, M. M.; Singh, I.; Fusar-Poli, P.; Nevado-Holgado, A.; McGuire, P.; Oliver, D.

2026-04-04 psychiatry and clinical psychology 10.64898/2026.04.02.26349960 medRxiv

Top 0.1%

3.6%

Show abstract

Psychosis prevention relies on early detection of individuals at clinical high risk for psychosis (CHR-P) remains limited, constraining preventive care. The effectiveness of the CHR-P state is constrained, in part due to clinical assessments requiring specialist interpretation of narrative interviews, limiting scalability. Here, we evaluate whether large language models (LLMs; deep learning models trained on large text corpora to process and generate language) can extract clinically meaningful information from such interviews to support psychosis risk assessment. We assessed 11 open-weight LLMs on 678 PSYCHS interview transcripts from 373 participants (77.7% CHR-P). Models inferred CHR-P status and estimated severity and frequency across 15 symptom domains, benchmarked against researcher-rated scores. Larger models achieved the strongest classification performance (Llama-3.3-70B: accuracy = 0.80, sensitivity = 0.93, specificity = 0.58). LLM-generated symptom scores showed good correlations with researcher-rated scores (ICCsev = 0.74, ICCfreq = 0.75). Performance disparities were minimal across most demographic groups but varied across sites. Generated summaries were largely faithful to source transcripts, with low rates of clinically relevant confabulation (3%). Errors primarily reflected over-pathologisation of non-clinical experiences. While accuracy scaled with model size, smaller models achieved competitive performance with substantially lower computational cost. These findings demonstrate that open-weight LLMs can assess psychosis risk from clinical interview transcripts, supporting scalable, human-in-the-loop approaches to early detection.

14

A computational decision-support approach for personalised care in youth mental health: A pilot feasibility study protocol

Iorfino, F.; Turner, A.; Varidel, M.; de Haan, Z.; Roberts, A. E.; Zhang, T.; An, V.; Huntley, S.; Marchant, R.; Crouse, J. J.; Cripps, S.; Barakat, S.; Maguire, S.; Oliver, D.; Scott, E. M.; Thornton, L.; Robinson, J.; LaMonica, H. M.; Hickie, I. B.

2026-05-15 psychiatry and clinical psychology 10.64898/2026.05.12.26353058 medRxiv

Top 0.1%

3.5%

Show abstract

Introduction: Youth mental health presentations are largely heterogenous, making it difficult to match individuals to the most appropriate interventions. Personalised, measurement-based care has the potential to improve clinical decision-making and support shared decision-making, but remains challenging to implement in routine practice. Advances in digital monitoring and causal modelling offer new opportunities to identify individual-level processes driving mental health difficulties and to generate personalised decision-support. This pilot study aims to evaluate the feasibility and acceptability of the Minding Your Mind computational decision-support approach, a newly developed approach integrating routine outcome monitoring, individual-level causal modelling, and personalised feedback to support shared decision-making between young people and their clinicians. Methods and analysis: The study involves two phases. Phase 1 will recruit young people aged 15-25 years and mental health clinicians to participate in workshops to co-design the decision-support approach and its implementation into routine practice. Phase 2 is a prospective, single-arm feasibility study involving young people receiving mental health care and their treating clinicians. Primary outcomes include feasibility, acceptability, appropriateness, and usability of the decision-support approach, assessed via self-report and objective process indicators. Secondary outcomes include changes in use and experiences with shared decision-making, and clinical and functional outcomes. Quantitative analyses will be primarily descriptive, with exploratory pre-post comparisons and sensitivity analyses. Qualitative interviews will explore user experiences and implementation barriers and facilitators. Ethics and dissemination: This study has been approved by the Sydney Local Health District (RPAH Zone) Human Research Ethics Committee (X25-0341). All participants will provide informed consent prior to participation. Findings will be disseminated through peer-reviewed publications, conference presentations, and accessible summaries co-developed with young people with lived experience.

15

Speech-Based Markers in Paediatric ADHD: A Longitudinal Case-Control Study of Voice Features and Medication Effects

Bamberger, R.; Kuhles, G.; Lotter, L. D.; Dukart, J.; Konrad, K.; Guenther, T.; Siniatchkin, M.; Fuchs, M.; von Polier, G.

2026-03-31 psychiatry and clinical psychology 10.64898/2026.03.25.26348708 medRxiv

Top 0.1%

3.1%

Show abstract

Background Diagnosis and treatment monitoring of attention-deficit/hyperactivity disorder (ADHD) largely rely on subjective assessments, highlighting the need for objective markers. Voice features and speech embeddings represent promising candidates for such markers, as they may capture alterations in speech production relevant to ADHD. However, it remains unclear which speech features are most informative for distinguishing ADHD and monitoring treatment effects, and which speech tasks most reliably elicit such differences. Methods Twenty-seven children with ADHD and 27 age-matched neurotypical controls completed six speech tasks across two study visits. Children with ADHD were unmedicated at baseline (first visit) and were assessed under prescribed methylphenidate treatment at follow-up, whereas controls underwent repeated assessment without intervention. Established acoustic voice features (eGeMAPS) and high-dimensional speech embeddings (WavLm, Whisper) were extracted and analysed using linear mixed models to examine baseline group differences and group-by-time interaction effects reflecting medication-associated change patterns. Results At baseline, children with ADHD differed significantly from controls in frequency, spectral, and temporal voice features, characterized by lower and more variable pitch, altered spectral properties, and reduced rhythmic stability. Group-by-time interaction effects indicated medication-associated modulation in the ADHD group, including reduced loudness variability and increased precision of vowel articulation at follow-up, changes not observed in controls. Speech embeddings revealed additional baseline and interaction effects beyond established acoustic features. Free speech tasks, particularly picture description, yielded the most robust and consistent effects. Conclusion Children with ADHD differed from neurotypical controls in vocal features at baseline and showed distinct longitudinal change patterns consistent with medication-related change. These findings support further investigation of speech-based measures as candidate digital phenotypes and potential digital biomarkers in ADHD, with picture description emerging as a particularly promising task for future clinical assessment protocols.

16

Consistency of Linguistic and Cognitive Processing Measures to Discriminate Children with and without Developmental Language Disorder (DLD): Comparing Likelihood Ratios (LHs) and Elastic Net Regression Computational Models.

Sharma, S.; Golden, R. M.; Montgomery, J. W.; Gillam, R. B.; Evans, J.

2026-03-09 psychiatry and clinical psychology 10.64898/2026.03.09.26347082 medRxiv

Top 0.1%

3.0%

Show abstract

Because both monothetic and polythetic diagnostic classification approaches focus on the presence of individual symptom(s) to identify individuals in a clinical population, they may be diagnostically sensitive clinical markers of multidimensional disorders such as developmental language disorder (DLD). DLD researchers have also used likelihood ratios (LHs) to identify possible diagnostic clinical markers of DLD, however the diagnostic sensitivity of LHs varies markedly across studies. A recent multidimensional computational elastic-net regression examined a total of 71 measures of spoken language and cognitive processing from a cohort of 223 children ages 7;0 to 11;0 with and without DLD (DLD = 110; typically developing (TD) controls = 113). All 200 iterations of the model had high discriminative power (87% - 88%) in positively identifying and distinguishing the DLD participants across all thresholds. Notably, the models identified a sparse DLD-specific deficit profile which only included nine of the 71 measures. In this study, we ask if the individual LHs for each of these nine measures are equally sensitive in identifying and discriminating the children with DLD from TD controls or if diagnostic markers of multidimensional disorders such as DLD can only be identified based on computational modeling approaches. The LHs for each of the nine measures were in the moderately high ranged (3.25 - 10). However, at the the highest LH cut points for each measure, there was little to no overlap in the children each measure identified as having DLD. Follow up analysis revealed that the elastic net model-derived predictive scores for each participant were significantly correlated with the participants language ability. The model also identified a subgroup of TD participants as having the same DLD-deficit profile as the DLD participants. This subgroup were younger, predominantly male participants whose standardized language assessment scores were lower as compared to the larger TD cohort. Taken together, the results from this study show that, because multidimensional modeling approaches such as elastic net regression leverage the variability in the deficit profiles across individual members of a diagnostic group and the unique contributions of each of the behavioral features of the phenotype, they may be an effective tool in deriving diagnostically specific deficit profiles for phenotypically complex, multicausal, multidimensional, neurodevelopmental disorders such as DLD. The results also demonstrate the robustness of the derived DLD-specific deficit profile in identifying individuals with "mild" or subclinical DLD, demonstrating the potential utility of this approach in both clinical and research arenas. What this paper adds.O_ST_ABSWhat is already known on this subject.C_ST_ABSThe identification of diagnostic markers for DLD has been a challenge for both clinicians and researchers across multiple decades. Monothetic classification markers such as non-word repetition, optional infinitive, or syntax dependencies have been explored, as well as polythetic classification approaches where a list of diagnostic symptoms is used together. However, each assumes different criteria and symptoms that should be included as diagnostic markers of DLD. What this study adds.Our study assessed the feasibility and effectiveness of monothetic vs. polythetic classification approaches for identifying DLD. Since our prior work, which used elastic net logistic regression computational modeling with strong discriminatory power, consistently selected nine key features as the DLD-deficit profile, in this effort, we calculated each of the nine features likelihood ratios to examine each measures ability to identify children with DLD. The monothetic approach failed to identify a consistent set of children with DLD, and the polythetic classification approach also did not identify participants who were shown to have mild DLD by the elastic net modeling approach. Instead, our analysis showed that a computational modeling approach, such as elastic net regression, that included small but important input from multiple cognitive and linguistic aspects of children, could better capture multifaceted information about the disorder, better account for individual variability, and consistently identify most participants with DLD. Clinical implications of this study.Elastic net logistic regression identifies a small subset of important features for distinguishing DLD and can assign a probability of DLD presence for each participant. Instead of the polythetic and monothetic approaches commonly used in the field, our study shows that integrating advanced computational modeling, such as elastic net regression, with clinician judgment can better refine assessment processes and address prior and ongoing inconsistencies in the DLD literature and diagnostic practices.

17

Stratifying the risk of transition to adult-onset psychiatric disorders in adolescents with anxiety

Dennison, C. A.; Shakeshaft, A.; Riglin, L.; Rice, F.; Andreassen, O.; Ask, H.; Havdahl, A.; Pine, D.; Martin, J.; Thapar, A.

2026-05-21 psychiatry and clinical psychology 10.64898/2026.05.15.26353293 medRxiv

Top 0.1%

2.9%

Show abstract

Background Escalating mental health service demands have created a need to better identify young people most likely to require continued support from mental health services at the transition between childhood and adulthood. Anxiety is the most common adolescent mental health condition, yet its clinical significance and prognosis are not well understood. We aimed to examine the risk of young adult-onset psychiatric disorders in individuals with an adolescent anxiety disorder, and identify stratifiers of risk of subsequent psychiatric disorders in this group. Methods Individuals from the Norwegian Mother, Father, and Child Cohort Study (MoBa) with linked health records and aged 18 or over as of the 31st December 2023 were included. Those diagnosed with any ICD-10 anxiety disorder when aged 10-17 years were defined as having an adolescent anxiety disorder (n=2107, controls n=47,582). Polygenic scores (PGS) for psychiatric and neurodevelopmental conditions were calculated using LDpred2. Anxiety, comorbidities, and parental psychiatric history were defined through linked ICD-10 diagnoses. Sex was defined through linked records. Individuals were defined as having a young adult-onset psychiatric disorder if they first received any new psychiatric diagnosis aged 18-24. Results Adolescent anxiety diagnosis was associated with increased risk of all adult-onset psychiatric disorders (HR= 2.33-8.65). Post-traumatic stress disorder PGS, parental history of severe mental illness, and female sex were associated with increased risk of transition to a young adult-onset psychiatric disorder in people with an adolescent anxiety disorder. Conclusions Adolescent anxiety greatly increases the risk of a psychiatric disorder during the transition to adult life. Clinicians should consider female sex and parental psychiatric history when prioritising young people with anxiety for adult mental health service support. Future research needs to further consider whether polygenic scores would aid risk stratification in clinical practice.

18

Compact longitudinal representations derived from mixed-format lifestyle questionnaires outperform static text-derived features for ALS-versus-control classification

Radlowski Nova, J.; Lopez-Carbonero, J. I.; Corrochano, S.; Ayala, J. L.

2026-03-25 bioinformatics 10.64898/2026.03.23.713709 medRxiv

Top 0.1%

2.9%

Show abstract

BackgroundMixed-format lifestyle questionnaires contain both structured variables and free-text responses, but it remains unclear whether language-derived variables provide incremental predictive value beyond structured data, and under which representational condition. It was investigated whether variables derived from patient-reported free text improve ALS-versus-control classification beyond structured questionnaire data, and whether their value depends on how temporal information is represented. MethodsA leakage-free machine-learning pipeline was developed to classify ALS versus controls from questionnaire-derived data, including a schema-guided LLM-based text-to-table extraction and a compact longitudinal encoding strategy. Three feature configurations were compared: Pool1, containing structured baseline variables only; Pool2, adding compact summaries derived from first-time-point (T1) free-text responses; and Pool3, further incorporating compact descriptors of change between T1 and T2. Logistic Regression, linear Support Vector Classification, and Random Forest were evaluated using repeated stratified holdout (10 seeds) and repeated stratified 5-fold cross-validation. Final ablation analyses were performed to isolate the contribution of the compact text block and the compact temporal block. ResultsAfter leakage correction, performance estimates became more conservative, indicating that previous results had been optimistic. In the final configuration, Pool3 achieved the best performance, with Random Forest reaching a holdout accuracy of 0.673, F1-weighted score of 0.666, and Matthews correlation coefficient of 0.323; cross-validated F1-weighted score and Matthews correlation coefficient were 0.654 and 0.312, respectively. Pool2 did not show a robust improvement over Pool1. Ablation analysis showed that removing the compact temporal block markedly reduced Pool3 performance, whereas removing the compact text block had little overall effect. These findings indicate that the primary value of language-based processing in small clinical cohorts lies not in static feature enrichment, but in enabling compact representations of longitudinal change. ConclusionsIn this setting, the main predictive gain did not arise from static text-derived variables alone, but from representing questionnaire information as compact longitudinal change descriptors. These findings suggest that, in small clinical cohorts, the value of language-based processing may lie more in summarizing trajectories than in expanding static feature spaces.

19

Comparative effectiveness of preferred pharmacological treatment options for bipolar disorder among people with opioid use disorder in British Columbia and Ontario, Canada: protocol for parallel population-based target trial emulations

Hossain, M. B.; Yan, R.; Morin, K. A.; Rotenberg, M.; Russolillo, A.; Solmi, M.; Lalva, T.; Marsh, D. C.; Nosyk, B.

2026-04-03 psychiatry and clinical psychology 10.64898/2026.04.02.26350000 medRxiv

Top 0.1%

2.7%

Show abstract

Introduction People with bipolar disorder (BD) and concurrent opioid use disorder (OUD) experience more severe clinical outcomes, including higher mortality, treatment complexity, and worse psychiatric symptoms, yet they are underserved due to a lack of tailored clinical guidelines and limited supporting research on competing treatment options. While pharmacological treatments for BD are well-established, their use varies widely across settings, and their effectiveness in individuals with co-occurring OUD is unclear. We propose parallel population-based studies to emulate randomized controlled trials to assess the comparative effectiveness of pharmacological treatment options for BD among people with OUD in British Columbia and Ontario, Canada, 2010-2023. Methods and analysis We propose emulating a series of parallel target trials using linked population-level health administrative data for all individuals aged 18 years or older diagnosed with both BD and OUD and who initiated treatments for BD between 1 January 2010 and 31 December 2023. All analyses will be conducted in parallel in British Columbia and Ontario. We propose a series of four successive target trial emulations, comparing (i) lithium versus non-antipsychotic mood stabilizers such as divalproex, lamotrigine, and valproic acid; (ii) lithium versus 2nd generation antipsychotics with mood stabilizing properties such as risperidone, olanzapine, aripiprazole, and quetiapine; (iii) lithium versus combination treatments such as lithium and divalproex, lithium and olanzapine, lithium and aripiprazole, lithium and quetiapine, divalproex and olanzapine, and olanzapine and quetiapine; (iv) lithium and valproate (LATVAL) versus lithium and olanzapine, lithium and aripiprazole, lithium and quetiapine, divalproex and olanzapine, and olanzapine and quetiapine. Incident user and prevalent new user analyses are planned for proposed target trials (i)-(iv), pending sufficient data. Stratified analyses will be conducted for BD-I, manic and depressive phases of BD illness. We propose an initiator analysis (intention-to-treat, conditional on medication dispensation) to determine the effectiveness of the treatments and per-protocol analyses to determine the efficacy of the treatments after dealing with treatment switching and recommended dose adjustment. The outcomes will include psychiatric acute-care visits (hospitalizations and emergency department visits), BD treatment discontinuation and all-cause mortality. Subgroup and sensitivity analyses, including cohort and study timeline restrictions, eligibility criteria modifications, and outcome reclassifications, are proposed to assess the robustness of our results. Executing analyses in parallel across settings using a co-developed protocol will allow us to evaluate the replicability of findings. Ethics and dissemination The protocol, cohort creation, and analysis plan have been classified and approved as a quality improvement initiative by the Providence Health Care Research Ethics Board and the Simon Fraser University Office of Research Ethics. Results will be disseminated to local advocacy groups, clinical groups and decision-makers, national and international clinical guideline developers, presented at international conferences, and published in peer-reviewed journals.

20

Mystical Experience Induced by Esketamine Treatment: A Real-World Observational Study

Mallevays, M.; Fuet, L.; Danon, M.; Di Lodovico, L.; Jaffre, C.; Bouzeghoub, L.; Mrad, S.; Rousselet, A.-V.; Allary, L.; Muh, C.; Vissel, B.; De Maricourt, P.; Vinckier, F.; Gaillard, R.; Mekaoui, L.; Gorwood, P.; Petit, A.-C.; Berkovitch, L.

2026-04-01 psychiatry and clinical psychology 10.64898/2026.03.31.26349757 medRxiv

Top 0.1%

2.7%

Show abstract

Esketamine is a fast-acting antidepressant drug which induces acute psychoactive effects. The most frequent is a dissociative state which seems unrelated to therapeutic efficacy. Other esketamine-induced effects, including psychedelic-like mystical experiences, have been poorly studied in terms of phenomenology and frequency, and may carry specific therapeutic relevance. In this study, we characterised esketamine-induced mystical experiences in relation with clinical outcomes. We conducted a longitudinal observational study and systematically measured acute subjective effects in patients receiving esketamine for treatment-resistant depression after each administration across the induction phase. A total of 45 patients were included, from two independent centres, totalling 352 esketamine administrations. Principal Component Analysis (PCA) supported the validity of the Mystical Experience Questionnaire (MEQ-30) for assessing esketamine-induced subjective effects, with components recovering dimensions previously validated with classic psychedelics. Mystical experiences (MEQ-30 score above 60) occurred in 58% of patients, with high inter- and intra-individual variability in frequency, intensity, and phenomenology across sessions. Higher mean and peak MEQ scores were associated with greater improvement in Montgomery-Asberg Depression Rating Scale scores from pre- to post-treatment, whereas the intensity of dissociative or other non-mystical effects was not. Positive mood and mystical MEQ dimensions in particular predicted therapeutic outcomes. Baseline spirituality also significantly predicted treatment outcomes and peak MEQ scores in the first week of treatment. These findings add to the growing body of evidence suggesting that psychedelic-like mystical experiences may be associated to therapeutic efficacy, not only in classic psychedelic-assisted therapy, but also in esketamine treatment.